Corpora in machine translation

نویسنده

  • Hanne Moa
چکیده

In spite of this quote there are many machine translation systems in use today, and more are being made, as the need for translations is seemingly boundless. For instance the contracts, agreements, laws and parliamentary sessions of the EU need to be translated somehow, and even if machine translation as good as human translation is infeasible, as that is what Bar-Hillel was concerned about, even quickly made, partial and rudimentary translations can be of help to a translator, or to choose what texts need to be properly translated in the first place. Intriguingly, it turns out that corpora can help with the second claim of the quote, that “Computer understanding of text is too difficult.”. Corpora can provide some understanding of the world simply by being a source for deriving frequencies or other statistically significant phenomena like finding collocations and words that don’t follow the rules. In this paper I will zoom in from the general to the specific, going from the past up to today. Section 2 looks at early use of corpora and machine translation (hereafter MT), section 3 is about modern MT and its growing dependence on corpus linguistics and finally, section 4 is about a specific MT-project that is still under development and its use of corpora, namely LOGON (Lønning et al., 2004).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Inclusion of large input corpora in Statistical Machine Translation

In recent years, the availability of large, parallel, bilingual corpora has gone untapped by the statistical machine learning community. The crux of the problem lies in the inherent linearity of the traditional machine-translation algorithms, which impedes easy inclusion of new, large input corpora. However, it has been speculated [1] that there exists a log-linear relationship between the trai...

متن کامل

Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora

tra Statistical machine translation systems are usually trained on large amounts of bilingual text and monolingual text. In this paper, we propose a method to perform domain adaptation for statistical machine translation, where in-domain bilingual corpora do not exist. This method first uses out-of-domain corpora to train a baseline system and then uses in-domain translation dictionaries and in...

متن کامل

Comparabilty of Corpora in Human and Machine Translation

In this study, we demonstrate a negative result from a work on comparable corpora which forces us to address a problem of comparability in both human and machine translation. We state that it is not always defined similarly, and comparable corpora used in contrastive linguistics or human translation analysis cannot always be applied for statistical machine translation (SMT). So, we revise the d...

متن کامل

Collecting and Using Comparable Corpora for Statistical Machine Translation

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005